Block Analysis of Bilingual Corpus for Chinese-English Statistical Machine Translation
نویسندگان
چکیده
In this paper, we describe a bilingual corpus processing strategy, block analysis, from a new point of view. By this analysis strategy, we want to extract more information from bilingual corpus for future statistical machine translation. At first, we define some block types and give some statistical data from a Chinese-English bilingual corpus under this framework. Then a block-based alignment algorithm is presented, by which we can extract and align the corresponding bilingual blocks automatically. Some experimental results show that block analysis is practical and more informative than any other word-based approach.
منابع مشابه
Pre-processing of Bilingual Corpora for Mandarin-English EBMT
Pre-processing of bilingual corpora plays an important role in Example-Based Machine Translation (EBMT) and Statistical-Based Machine Translation (SBMT). For our Mandarin-English EBMT system, pre-processing includes segmentation for Mandarin, bracketing for English and building a statistical dictionary from the corpora. We used the Mandarin segmenter from the Linguistic Data Consortium (LDC). I...
متن کاملBuilding A Case-based Semantic English-Chinese Parallel Treebank
Abstract We construct a case-based English-to-Chinese semantic constituent parallel Treebank for a Statistical Machine Translation (SMT) task by labelling each node of the Deep Syntactic Tree (DST) with our refined semantic cases. Since subtree span-crossing is harmful in tree-based SMT, DST is adopted to alleviate this problem. At the same time, we tailor an existing case set to represent bili...
متن کاملThe TCH machine translation system for IWSLT 2008
This paper reports on the first participation of TCH (Toshiba (China) Research and Development Center) at the IWSLT evaluation campaign. We participated in all the 5 translation tasks with Chinese as source language or target language. For Chinese-English and English-Chinese translation, we used hybrid systems that combine rule-based machine translation (RBMT) method and statistical machine tra...
متن کاملStatistical Analysis of Alignment Characteristics for Phrase-based Machine Translation
In most statistical machine translation (SMT) systems, bilingual segments are extracted via word alignment. However, there lacks systematic study as to what alignment characteristics can benefit MT under specific experimental settings such as the language pair or the corpus size. In this paper we produce a set of alignments by directly tuning the alignment model according to alignment F-score a...
متن کاملSYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011
This report describes SYSTRAN’s Chinese-English and English-Chinese machine translation systems that participated in the CWMT 2011 machine translation evaluation tasks. The base systems are SYSTRAN rulebased machine translation systems, augmented with various statistical techniques. Based on the translations of the rule-based systems, we performed statistical post-editing with the provided bili...
متن کامل